class: center, middle <style> .remark-slide-number { display: none; } .remark-slide-content > h3 { text-align: left !important; font-size: 28px; color: red; } a{ color: red; text-decoration: none; /* turns off background coloring of links */ } </style> ## Presentation online ### **https://hlageek.github.io/reports/lyon2019/figuresdestyle.html** ??? Initial slide, before title --- background-image: url(img/logos.png) background-position: 95% 50% background-size: 10% class: inverse <style> .remark-slide-content > h3 { text-align: left !important; color: white; }; </style> # # The Effect of Fiction on the Clustering of Sociological Texts @ *Les figures de style en sciences humaines et sociales* (March 8, 2019 in Lyon, France) ### Radim Hladík National Institute of Informatics (Japan) Institute of Philosophy of the Czech Academy of Sciences (Czech Republic) ##### Contact email: radim.hladik@fulbrightmail.org twitter: @hlageek ##### Acknowledgement COST Conference Grant: IS1404 Evolution of reading in the age of digitisation (E-READ) --- background-image: url(img/versus.png) background-position: 50% 30% background-size: 10% # Sociology as a "third culture" .pull-left.center[ #### <center>Sociologists</center> <img src="img/sociologists.png" height="250"> ] .pull-right.center[ #### <center>Writers</center> <img src="img/writers.png" height="250"> ] <br><br> <center> Who should describe the experience of modernity? </center> .right[W. Lepenies (1988)] ??? Lepenies raises the issue as a historical or even a biographical one, noting the interactions between sociologists and writers in the position of public intellectuals. The importance of science in the 19th century was extremely high. (Possibly quite unlike current distrust in expertise.) --- # Structural position of social sciences #### ... and sociology in particular .center[ <img src="img/bourdieu.png" height="280"> ] .right[adapted from P. Bourdieu (1988: 122)] *“Any sociologists whose exaggerated concern with linguistic finesse might threaten their status as scientific researchers can resist this, more or less consciously, by rejecting literary elegance and draping themselves in the trappings of scientificity (graphs, statistical tables, even mathematical formalism, etc.).”* (Bourdieu, 1988: 29) ??? As sociologists, we expect for this structured nature of sociological writing to have consequences. --- background-image: url(img/versus.png) background-position: 50% 30% background-size: 10% # Bifurcated publication cultures .pull-left.center[ #### <center>Books</center> <img src="img/books.jpg" height="200"> ] .pull-right.center[ #### <center>Articles</center> <img src="img/journals.jpg" height="200"> ] + Wolfe (1990) speaks of “two faces” of sociology, or **institutionalized cultures** - private/public, east(west)/midwest+south, urban/rural, awards/chairs + Pontille (2003) shows **national differences** - France/USA + Clemens et al. (1995) describes complex (gender, elite status) but notable **partial genre differention** in career pathways based primarily on *evidentiary base* - qualitative/quantitative --- # Bibliometric reflexions + Sociology as a **"paradigmatic social science"** (Hicks 2004) is difficult to measure by standard citation indicators which are biased in favor of journal publications - psychology and economics have citation cultures similar to the sciences - history, anthropology, the humanities are book-oriented - sociology is split + Peculiarities of sociological citation culture(s) - books are more frequently cited, but most citations to them come from outside of sociology (Clemens et al. 1995) - books are mostly qualitative, but the most cited books are quantitative (Clemens et al. 1995) - evidence of increasing "articlisation" (Renisio and Paye 2017) - long citation windows (Leydesdorff et al. 2016) - overciting foundational literature compared both the natural sciences and the humanities (Hargens 2000) ??? Clemens et al. on citations: Books are more frequently cited by other fields. Most cited books were quantitative. Hicks speaks of 4 social science literatures: - books/journals/national/non-academic --- # Implications for collaboration patterns <br> + Teams and collaborative work & authorship have become a standard in science (Wuchty, Jones and Uzzi 2007) - in the natural sciences in the 21st century, single authorship is more of a rarity - **co-authorship rising across the social sciences**, not so much in the humanities + Mixed information about the situation in sociology - sociology and psychology among most collaborative social sciences (Babchuk, Keith, and Peters 1999) - co-authorship in sociology is the norm (> 50%), but the size of the teams is not increasing as much as in other social sciences (Henriksen 2016) - inside sociology, some subfields are more likely to work in teams and, in general, quantitative work is a good predictor of collaboration (Moody 2004) - Leahey and Reikowsky (2008) suggest that the collaborative practices in sociology are driven more by productivity concerns than by methodological or subject matter exigencies ??? Regardless of what the actual situation is inside sociology, the important thing is the clear model in the natural sciences (and the resistance towards the model in the humanities). --- # Genre and gender <br> + Feminist criticism has repeatedly noted the affinity between the astracted discourse of science and male-centric world view (Haraway 1991, Smith 1990) - in contrast, women authors in sociology adopt situated viewpoints and prefer qualitative approaches (Grant, Ward, and Rong 1987; Oakley 1998; Platt 2007) - in the US, sociology is now predominantly female discipline (Hur et al. 2017) - no conclusive research on a gender bias in citations; in some subfields, women can receive citation premium (Ward, Gast, and Grant 1992) - women often adopt the book publication culture, some journals remain dominated by men (Karides, et al. 2001) ??? --- # Writing in sociology as social action <br> + Research on sociological **publications** reveals persisting **social boundaries**, but does not examine **writing** as a **social action** of its own. + Some books are, actually, collections of articles (Wolfe 1990). + Abend et al. (2013) conducted content analysis of ethnograpic articles: - Ethnographic research published in American generalist sociological journals employs causal language and rhetorical devices typical for quantitative articles. - In Mexican journals, ethnography builds on interpretation and "shedding light". + Rhetorical scholars (Bazerman 1988) have emphasized that language is actively participating in making knowledge claims. ??? Existing literature studies publications, but not writing, yet writing itself is an epistemological statement, if you will, the form of which shapes the knowledge claims made in the scientific communication. --- # Hypotheses for validation In the structure of of sociological publication field, authors take positions through writing. The primary distinction is not between books and articles, qualitative and quantitative evidentiary base, but between **scientific** and **literary** sociology. + H1: Articles classified as “scientific sociology” have more authors per paper than articles classified as “literary sociology”. + H2: Articles classified as “scientific sociology” receive more citations per paper than articles classified as “literary sociology”. + H3: In articles classified as “scientific sociology” men appear as first authors more frequently than in articles classified as “literary sociology”. ??? Taking position by conforming to certain style and deploying specific rhetoric. So, for example, the "articlisation" of social sciences may follow the model of publication in the natural sciences, without changing the intellectual structure of social sciences. How can we tell if certain style of writing in sociology is scientific or literary? --- # Design How can we tell if a piece of sociological writing follows literary or scientific style? 1. Combine sociological and literary corpora. 2. Use exploratory methods (clustering) over **textual features** of sociological articles. 3. Validate the clusters against **social features** of sociological articles ??? There are two obvious ways how to approach the classification problem: 1) More traditional lingustics: Identify traits that have been associated with scientific writing. (E.g. hedges). 2) Computational linguistics: Build a machine learning classifier after experts prepare a training set. 3) Unsupervised method that admits that we do not know what makes the difference, but takes advantage of the fact that we know where to look for the traits of literary style outside of sociology: in literature! --- # Getting the data <br> .center[ <img src="img/data_sources.png" height="200"> ] + Czech Sociological Review - core, generalist journal of Czech sociology (the only Czech sociological journal in WoS) - scraping the website for fulltexts and metadata - 522 articles selected + Short stories extracted from a Large Corpus of Written Czech v. 4 + Web of Science + enriching the data (number of authors, gender, citations, part-of-speech tagging) ??? Data required a lot of cleaning and some OCRed texts have quite a few errors. Selected articles - originally written in Czech (no translations), in paper category, where unclear manual checking. 499 (95%) matched with WoS records. --- # Data overview Corpus|Documents|Tokens|Lemmas|Years ------|---------|------|------|----- Short fiction|153|7977791|79401|1991-2014 CSR articles|522|2466746|46345|1993-2016 .pull-left[ <img src="img/bagofverbs.png" height="300"> ] .pull-right[ <br> <br> **Bag of (shared) verbs** as early as the 17th and 18th centuries, scientific communication “was not as ‘verbally’ diverse as one would expect from a literary prose” (Gross, Harmon, and Reidy 2002, 79). ] --- # Dimensionality reduction 1) 2 copora, 675 documents 2) Document-term matrix of 4650 shared verbs with TF-IDF scores 3) Retain only 33 non-sparse verbs (> 35% of documents) 4) Principal component analysis - 5 components .center[ <img src="img/pca_sparsity.png" height="400"> ] ??? Trade-off between sparsity reduction and explained variance guided the selection of 35% threshold. Function words, used in stylometry, did not work for the task. They are effective in distinguishing small differences within a body of similar work. Content-bearing words also do not work for obvious reasons. The question of rhetoric and style was thus recasted as the manner of use, or relative importance of 33 most frequent shared verbs. --- # PCA results graph .center[ <img src="img/pca_sparsity.png" height="400"> ] --- # PCA results table .pull-left[ Verb|PC1|PC2 ----|----|----| can/be able (moci)|| say/tell (říci)|-0.466| want (chtít)|-0.411|0.225 must (muset)|-0.63| see (vidět)|-0.525|0.114 begin (začít)|-0.196|0.252 give (dát)|-0.477|0.109 attempt/strive (snažit)||0.422 show /demonstrate (ukázat)|0.133|-0.566 mean (znamenat)||-0.303 belong (patřit)|0.485| ] .pull-right[ Verb|PC1|PC2 ----|----|----| exist (existovat)|0.211|-0.198 represent/imagine (představovat)|0.234|0.252 assume (předpokládat)||-0.409 consider (považovat)|0.507| be valid/ pay (platit)||-0.527 create (tvořit)|0.364| dedicate /devote (věnovat)|0.506|0.184 answer/correspond (odpovídat)|0.344|-0.368 show/indicate (ukazovat)||-0.545 can/be possible (lze)|0.332|-0.256 state/mention (uvést)|0.566|-0.352 ] --- # PCA results table .pull-left[ Verb|PC1|PC2 ----|----|----| can/be able (moci)|| say/tell (říci)|-0.466| want (chtít)|-0.411|0.225 **must** (muset)|-0.63| **see** (vidět)|-0.525|0.114 begin (začít)|-0.196|0.252 give (dát)|-0.477|0.109 attempt/strive (snažit)||0.422 **show** /demonstrate (ukázat)|0.133|-0.566 mean (znamenat)||-0.303 belong (patřit)|0.485| ] .pull-right[ Verb|PC1|PC2 ----|----|----| exist (existovat)|0.211|-0.198 represent/imagine (představovat)|0.234|0.252 assume (předpokládat)||-0.409 consider (považovat)|0.507| be valid/ pay (platit)||-0.527 create (tvořit)|0.364| dedicate /devote (věnovat)|0.506|0.184 answer/correspond (odpovídat)|0.344|-0.368 show/indicate (ukazovat)||-0.545 **can/be possible** (lze)|0.332|-0.256 state/mention (uvést)|0.566|-0.352 ] --- # Clustering tree .pull-left[ <img src="img/hc_tree.png" height="450"> ] .pull-right[ Hierarchical clustering with Ward method over 5 PCs. Hyperparameter `\(k=3\)` <br> <br> Cluster|Fiction|Sociology ----|-----|------ 1|153|22 2|0|287 3|0|213 ] --- # Clusters in the PC space .pull-left[ <img src="img/hc_pca.png" height="400"> ] .pull-right.center[ <img src="img/clusters_gif.gif" height="350"> ] --- # Literary/scientific sociology & gender <br> .center[ Category|Male authors|Female authors|Proportion of female authors|Total ----|-----|-------|------|------| Scientific sociology|**132**|**81**|38.03|213 Literary sociology|**225**|**84**|27.18|309 Total|357|165|**31.61**|522 ] ??? Two-by-two table shows statistically significant difference The direction of the relationship is different from the expected one. We have seen that gender is generally a tricky variable in the analysis of publication cultures. Proposed explanation: To publish in male-dominated venue, female authors adopt scientific style. --- # Literary/scientific sociology & team .center[ <img src="img/collab.png" height="500"> ] ??? Statistically significant difference Wilcoxon-Mann-Whitney. --- # Literary/scientific sociology & citations .center[ <img src="img/citations.png" height="500"> ] ??? Articles with 0 or 1 citation account for more than 53% of articles in the literary sociology category (c.f. 37%) Average citations per article 2.4 in literary sociology 4.06 in scientific sociology --- # Literary/scientific sociology & semantics The set of the most frequent nouns in each group in descending order. Scientific sociology|Literary sociology|Intersection --------|--------|--------| model (model)|sociology (sociologie)|year (rok) respondent (respondent)|problem (problém)|analysis (analýza) education (vzdělání)|life (život)|research (výzkum) number (počet)|theory (teorie)|woman (žena) result (výsledek)|child (dítě)|job/work (práce) value (hodnota)|process (proces)|human/people (člověk) data (data)|state (stát)|case (případ) difference (rozdíl)|way/manner (způsob)|country (země) variable (proměnná)|space (prostor)|group (skupina) table (tabulka)|situation (situace)|rate/extent (míra) man (muž)|city (město)|relationship (vztah) --- # Literary/scientific sociology & semantics (continued) .pull-left[ Scientific sociology|Literary sociology|Intersection --------|--------|--------| elections/choice (volba)|politics (politika)|question (otázka) income (příjem)|framework (rámec)|society (společnost) level (úroveň)|change (změna)|system (systém) family (rodina)|period/epoch (doba)| proportion (podíl)|structure (struktura)| ] .pull-right.center[ <img src="img/semantics.png" height="400"> ] ??? This result supports the previously mentioned ties of "scientificity" in sociology to quantitative methods. HOWEVER, we got to this distinction through language rather than counting tables! --- # Pseudoexperiment <br> .center[ |Number of authors|Number of citations|Gender of the 1st author|Intersection of 30 most frequent nouns ----|------|-------|-------|------------| ] --- # Pseudoexperiment 1 <br> .center[ |Number of authors|Number of citations|Gender of the 1st author|Intersection of 30 most frequent nouns ----|------|-------|-------|------------| Random scenario|n.s.|ns|ns|22 ] --- # Pseudoexperiment 2 <br> .center[ |Number of authors|Number of citations|Gender of the 1st author|Intersection of 30 most frequent nouns ----|------|-------|-------|------------| Random scenario|n.s.|ns|ns|22 Simple clustering scenario|< .001\*\*\*|0.011\*|ns|20 ] --- # Pseudoexperiment 3 <br> .center[ |Number of authors|Number of citations|Gender of the 1st author|Intersection of 30 most frequent nouns ----|------|-------|-------|------------| Random scenario|n.s.|ns|ns|22 Simple clustering scenario|< .001\*\*\*|0.011\*|ns|20 Original scenario|< .001\*\*\*|< .001\*\*\*|0.009\*\*|14 ] --- # Compare variance w/ & w/o fiction .pull-left.center[ <img src="img/withoutfiction.png" height="300"> ] .pull-right.center[ <img src="img/withfiction.png" height="300"> ] --- # In conclusion + Combination of sociological and literary corpora changes the way in which clusters are formed in sociology. + The results of modified clustering are consistent with the "third culture" theory. + Less literary sociology is - likely to be coauthored - likely to receive more citations - likely to be have female first author (sic!) + Unlike in other studies of sociological publication cultures, the presented approach relies solely on textual features. ??? To put it in simple terms, knowing how 33 most frequent verbs are used in sociological articles and, simultaneously, knowing how those same verbs are employed in literature, allows to predict which sociological articles are more likely to be coauthored, receive more citations, and have women as 1st authors. The last bit may be a result of dealing with a publication culture which is biased in favor of men. This unexpected finding can be explained by the need for women to write in less literary style if they are to be published in an outlet dominated by men We may detect rhetorical performance of scientific language even in non-quantitative articles and ignore the book vs article format. --- background-image: url(img/logos.png) background-position: 95% 50% background-size: 12% # The end ## The Effect of Fiction on the Clustering of Sociological Texts **https://hlageek.github.io/reports/lyon2019/figuresdestyle.html** #### Radim Hladík National Institute of Informatics (Japan) Institute of Philosophy of the Czech Academy of Sciences (Czech Republic) ##### Contact email: radim.hladik@fulbrightmail.org twitter: @hlageek ##### Acknowledgement COST Conference Grant: IS1404 Evolution of reading in the age of digitisation (E-READ)